Text Analytics in R with quanteda (Part 1)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Required Packages
library(quanteda) library(quanteda.textstats) library(quanteda.textplots) library(readr) library(dplyr) library(ggplot2) library(stringr) library(DT) library(tidytext)
Understanding Text Analytics Fundamentals
Text analytics transforms text into structured data suitable for analysis. A typical workflow looks like so:
- Text acquisition and loading: Importing text data from various sources
- Preprocessing: Cleaning and standardizing text
- Tokenization: Breaking text into meaningful units (words, sentences, n-grams)
- Document-Feature Matrix (DFM) creation: Representing text numerically
- Analysis: Extracting insights through statistical and computational methods
Creating a Corpus: Prepping data for analysis
A corpus is a structured collection of texts. quanteda provides the corpus() function to create corpus objects from various data sources.
Example 1: Simple Corpus from Character Vector
# Create a simple corpus from a character vector texts <- c( "The quick brown fox jumps over the lazy dog.", "Natural language processing is fascinating and powerful.", "Text analytics enables data-driven decision making.", "Machine learning algorithms can analyze text at scale.", "Data science combines statistics, programming, and domain knowledge." ) # Create corpus corp <- corpus(texts) # Examine the corpus summary(corp)
Corpus consisting of 5 documents, showing 5 documents: Text Types Tokens Sentences text1 10 10 1 text2 8 8 1 text3 7 7 1 text4 9 9 1 text5 10 11 1
Example 2: Corpus from Data Frame
In practice, text data typically comes with associated metadata (e.g., author, date, category). quanteda handles this well:
# Create a data frame with text and metadata
text_df <- data.frame(
text = c(
"Customer service was excellent and responsive.",
"Product quality is poor. Very disappointed.",
"Shipping was fast. Happy with my purchase.",
"Price is too high for the quality received.",
"Great value for money. Would recommend!"
),
rating = c(5, 2, 4, 2, 5),
product_category = c("Electronics", "Clothing", "Electronics", "Clothing", "Electronics"),
review_date = as.Date(c("2025-01-15", "2025-02-20", "2025-03-10", "2025-04-05", "2025-05-12")),
stringsAsFactors = FALSE
)
datatable(text_df)
# Create corpus from data frame reviews_corp <- corpus(text_df, text_field = "text") # Examine corpus with metadata summary(reviews_corp)
Corpus consisting of 5 documents, showing 5 documents: Text Types Tokens Sentences rating product_category review_date text1 7 7 1 5 Electronics 2025-01-15 text2 7 8 2 2 Clothing 2025-02-20 text3 8 9 2 4 Electronics 2025-03-10 text4 9 9 1 2 Clothing 2025-04-05 text5 8 8 2 5 Electronics 2025-05-12
# Access document variables (metadata) (columns not declared as "text_field") docvars(reviews_corp)
rating product_category review_date 1 5 Electronics 2025-01-15 2 2 Clothing 2025-02-20 3 4 Electronics 2025-03-10 4 2 Clothing 2025-04-05 5 5 Electronics 2025-05-12
# Subset corpus by metadata high_rated <- corpus_subset(reviews_corp, rating >= 4) summary(high_rated)
Corpus consisting of 3 documents, showing 3 documents: Text Types Tokens Sentences rating product_category review_date text1 7 7 1 5 Electronics 2025-01-15 text3 8 9 2 4 Electronics 2025-03-10 text5 8 8 2 5 Electronics 2025-05-12
Tokenization: Breaking text into units
Tokenization is the process of splitting text into individual units (tokens), typically words. The tokens() function provides lots of tokenisation capabilities.
Basic Tokenization
# Tokenize the reviews corpus toks <- tokens(reviews_corp) # View tokens from all documents print(toks)
Tokens consisting of 5 documents and 3 docvars. text1 : [1] "Customer" "service" "was" "excellent" "and" [6] "responsive" "." text2 : [1] "Product" "quality" "is" "poor" "." [6] "Very" "disappointed" "." text3 : [1] "Shipping" "was" "fast" "." "Happy" "with" "my" [8] "purchase" "." text4 : [1] "Price" "is" "too" "high" "for" "the" "quality" [8] "received" "." text5 : [1] "Great" "value" "for" "money" "." "Would" [7] "recommend" "!"
Advanced Tokenization Options
quanteda offers a lot of control over tokenization:
# Create sample text with various elements sample_text <- "Dr. Smith's email is [email protected]. He earned $100,000 in 2024! Visit https://example.com for more info. #DataScience #AI" sample_corp <- corpus(sample_text) # Different tokenization approaches tokens_default <- tokens(sample_corp) tokens_no_punct <- tokens(sample_corp, remove_punct = TRUE) tokens_no_numbers <- tokens(sample_corp, remove_numbers = TRUE) tokens_no_symbols <- tokens(sample_corp, remove_symbols = TRUE) tokens_lowercase <- tokens(sample_corp, remove_punct = TRUE) %>% tokens_tolower() # Compare results print(tokens_default)
Tokens consisting of 1 document. text1 : [1] "Dr" "." "Smith's" [4] "email" "is" "[email protected]" [7] "." "He" "earned" [10] "$" "100,000" "in" [ ... and 10 more ]
print(tokens_no_punct)
Tokens consisting of 1 document. text1 : [1] "Dr" "Smith's" "email" [4] "is" "[email protected]" "He" [7] "earned" "$" "100,000" [10] "in" "2024" "Visit" [ ... and 6 more ]
print(tokens_no_numbers)
Tokens consisting of 1 document. text1 : [1] "Dr" "." "Smith's" [4] "email" "is" "[email protected]" [7] "." "He" "earned" [10] "$" "in" "!" [ ... and 8 more ]
print(tokens_no_symbols)
Tokens consisting of 1 document. text1 : [1] "Dr" "." "Smith's" [4] "email" "is" "[email protected]" [7] "." "He" "earned" [10] "100,000" "in" "2024" [ ... and 9 more ]
print(tokens_lowercase)
Tokens consisting of 1 document. text1 : [1] "dr" "smith's" "email" [4] "is" "[email protected]" "he" [7] "earned" "$" "100,000" [10] "in" "2024" "visit" [ ... and 6 more ]
Stopword Removal
Stopwords are common words (e.g., “the”, “is”, “at”) that typically don’t carry significant meaning. Removing them reduces noise and improves ovrerall efficiency.
# View built-in English stopwords (first 20)
head(stopwords("english"), 20)
[1] "i" "me" "my" "myself" "we" [6] "our" "ours" "ourselves" "you" "your" [11] "yours" "yourself" "yourselves" "he" "him" [16] "his" "himself" "she" "her" "hers"
# Count of English stopwords
length(stopwords("english"))
[1] 175
# Remove stopwords from tokens
toks_no_stop <- tokens(reviews_corp,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english"))
# Compare with and without stopwords
print(tokens(reviews_corp, remove_punct = TRUE)[1])
Tokens consisting of 1 document and 3 docvars. text1 : [1] "Customer" "service" "was" "excellent" "and" [6] "responsive"
print(toks_no_stop[1])
Tokens consisting of 1 document and 3 docvars. text1 : [1] "customer" "service" "excellent" "responsive"
# Count tokens before and after print(ntoken(tokens(reviews_corp, remove_punct = TRUE)))
text1 text2 text3 text4 text5
6 6 7 8 6
print(ntoken(toks_no_stop))
text1 text2 text3 text4 text5
4 4 4 4 4
Stemming
Stemming reduces words to their root form by removing suffixes (e.g., “running” → “run”).
# Example text demonstrating word variations stem_text <- c( "The running runners ran faster than expected.", "Computing computers computed complex calculations.", "The analyst analyzed analytical data using analysis techniques." ) stem_corp <- corpus(stem_text) # Tokenize stem_toks <- tokens(stem_corp, remove_punct = TRUE) %>% tokens_tolower() # Apply stemming stem_toks_stemmed <- tokens_wordstem(stem_toks) # Compare original and stemmed print(stem_toks[1])
Tokens consisting of 1 document. text1 : [1] "the" "running" "runners" "ran" "faster" "than" "expected"
print(stem_toks_stemmed[1])
Tokens consisting of 1 document. text1 : [1] "the" "run" "runner" "ran" "faster" "than" "expect"
Document-Feature Matrix (DFM): Numerical Representation
A Document-Feature Matrix (DFM) is a numerical representation of text where rows represent documents, columns represent features (typically words), and cell values indicate feature frequency in each document. This structure enables statistical analysis and machine learning applications.
# Create a DFM from our reviews corpus
reviews_dfm <- reviews_corp %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english")) %>%
dfm()
# Examine the DFM
print(reviews_dfm)
Document-feature matrix of: 5 documents, 19 features (78.95% sparse) and 3 docvars.
features
docs customer service excellent responsive product quality poor disappointed
text1 1 1 1 1 0 0 0 0
text2 0 0 0 0 1 1 1 1
text3 0 0 0 0 0 0 0 0
text4 0 0 0 0 0 1 0 0
text5 0 0 0 0 0 0 0 0
features
docs shipping fast
text1 0 0
text2 0 0
text3 1 1
text4 0 0
text5 0 0
[ reached max_nfeat ... 9 more features ]
# View DFM dimensions print(dim(reviews_dfm))
[1] 5 19
Feature Statistics: Understanding Word Frequencies
Analyzing feature frequencies reveals the most important terms in the corpus.
# Calculate feature frequencies feat_freq <- textstat_frequency(reviews_dfm) # View top features head(feat_freq, 15)
feature frequency rank docfreq group 1 quality 2 1 2 all 2 customer 1 2 1 all 3 service 1 2 1 all 4 excellent 1 2 1 all 5 responsive 1 2 1 all 6 product 1 2 1 all 7 poor 1 2 1 all 8 disappointed 1 2 1 all 9 shipping 1 2 1 all 10 fast 1 2 1 all 11 happy 1 2 1 all 12 purchase 1 2 1 all 13 price 1 2 1 all 14 high 1 2 1 all 15 received 1 2 1 all
# Visualize top features
feat_freq %>%
head(15) %>%
ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 15 Most Frequent Terms",
x = "Term",
y = "Frequency") +
theme_minimal()

Group-Based Feature Analysis
Analyzing features by groups (e.g., high vs. low ratings) reveals distinctive vocabulary:
# Group DFM by rating category
reviews_dfm_grouped <- reviews_corp %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english")) %>%
dfm() %>%
dfm_group(groups = rating)
# Calculate frequencies by group
freq_by_rating <- textstat_frequency(reviews_dfm_grouped, groups = rating)
# View top features for each rating
print(freq_by_rating %>% filter(group == 5) %>% head(10))
feature frequency rank docfreq group 12 customer 1 1 1 5 13 service 1 1 1 5 14 excellent 1 1 1 5 15 responsive 1 1 1 5 16 great 1 1 1 5 17 value 1 1 1 5 18 money 1 1 1 5 19 recommend 1 1 1 5
print(freq_by_rating %>% filter(group == 2) %>% head(10))
feature frequency rank docfreq group 1 quality 2 1 1 2 2 product 1 2 1 2 3 poor 1 2 1 2 4 disappointed 1 2 1 2 5 price 1 2 1 2 6 high 1 2 1 2 7 received 1 2 1 2
# Visualize comparison
freq_by_rating %>%
filter(group %in% c(2, 5)) %>%
group_by(group) %>%
slice_max(frequency, n = 8) %>%
ungroup() %>%
mutate(feature = reorder_within(feature, frequency, group)) %>%
ggplot(aes(x = feature, y = frequency, fill = factor(group))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ group, scales = "free_y", labeller = labeller(group = c("2" = "2-Star Reviews", "5" = "5-Star Reviews"))) +
scale_x_reordered() +
coord_flip() +
labs(title = "Top Terms by Review Rating",
x = "Term",
y = "Frequency") +
theme_minimal()

Word Clouds: Visual Exploration
Word clouds provide intuitive visualization of term frequencies:
# Create word cloud
set.seed(123)
textplot_wordcloud(reviews_dfm,
min_count = 1,
max_words = 50,
rotation = 0.25,
color = RColorBrewer::brewer.pal(8, "Dark2"))

N-grams: Multi-Word Expressions
N-grams are contiguous sequences of n tokens. Bigrams (2-grams) and trigrams (3-grams) capture multi-word expressions and phrases that single words miss.
# Create sample text for n-gram analysis ngram_text <- c( "Machine learning and artificial intelligence are transforming data science.", "Natural language processing enables text analytics at scale.", "Deep learning models achieve state of the art results.", "Data science requires domain knowledge and technical skills.", "Text mining extracts insights from unstructured data." ) ngram_corp <- corpus(ngram_text) # Create bigrams bigrams <- ngram_corp %>% tokens(remove_punct = TRUE) %>% tokens_tolower() %>% tokens_ngrams(n = 2) %>% dfm() # Calculate bigram frequencies bigram_freq <- textstat_frequency(bigrams) head(bigram_freq, 15)
feature frequency rank docfreq group 1 data_science 2 1 2 all 2 machine_learning 1 2 1 all 3 learning_and 1 2 1 all 4 and_artificial 1 2 1 all 5 artificial_intelligence 1 2 1 all 6 intelligence_are 1 2 1 all 7 are_transforming 1 2 1 all 8 transforming_data 1 2 1 all 9 natural_language 1 2 1 all 10 language_processing 1 2 1 all 11 processing_enables 1 2 1 all 12 enables_text 1 2 1 all 13 text_analytics 1 2 1 all 14 analytics_at 1 2 1 all 15 at_scale 1 2 1 all
# Visualize top bigrams
bigram_freq %>%
head(10) %>%
ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
geom_col(fill = "indianred") +
coord_flip() +
labs(title = "Top 10 Bigrams",
x = "Bigram",
y = "Frequency") +
theme_minimal()

# Create trigrams
trigrams <- ngram_corp %>%
tokens(remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_ngrams(n = 3) %>%
dfm()
# Calculate trigram frequencies
trigram_freq <- textstat_frequency(trigrams)
# Visualize top bigrams
trigram_freq %>%
head(10) %>%
ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Trigrams",
x = "Trigram",
y = "Frequency") +
theme_minimal()

Real-World Example: Analyzing Customer Feedback
Let’s try and apply these techniques to a more realistic scenario.
# Create a realistic customer feedback dataset
set.seed(456)
feedback_df <- data.frame(
text = c(
"Absolutely love this product! Best purchase I've made all year. Quality is outstanding.",
"Terrible experience. Product broke after one week. Customer service was unhelpful.",
"Good value for the price. Works as expected. Would buy again.",
"Shipping took forever. Product is okay but not worth the wait.",
"Amazing quality and fast delivery. Highly recommend to everyone!",
"Product description was misleading. Not what I expected at all.",
"Decent product but customer support needs improvement. Long wait times.",
"Exceeded my expectations! Great features and easy to use.",
"Poor quality control. Received damaged item. Return process was difficult.",
"Perfect! Exactly what I needed. Five stars all around.",
"Overpriced for what you get. Better alternatives available elsewhere.",
"Good product but instructions were confusing. Setup took hours.",
"Love it! Works perfectly and looks great too.",
"Not satisfied. Product feels cheap and flimsy.",
"Best customer service ever! They resolved my issue immediately.",
"Average product. Nothing special but gets the job done.",
"Fantastic! Will definitely purchase from this company again.",
"Disappointed with the quality. Expected much better.",
"Great features but battery life is poor.",
"Excellent value. Highly recommend for budget shoppers."
),
rating = c(5, 1, 4, 2, 5, 1, 3, 5, 1, 5, 2, 3, 5, 2, 5, 3, 5, 2, 3, 4),
category = sample(c("Electronics", "Home & Kitchen", "Clothing"), 20, replace = TRUE),
helpful_votes = sample(0:50, 20, replace = TRUE),
stringsAsFactors = FALSE
)
# Create corpus
feedback_corp <- corpus(feedback_df, text_field = "text")
# Complete preprocessing pipeline
feedback_dfm <- feedback_corp %>%
tokens(remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english")) %>%
tokens_wordstem() %>%
dfm()
# Analyze overall sentiment-related terms
sentiment_terms <- c("love", "best", "great", "excel", "amaz", "perfect",
"terribl", "poor", "worst", "disappoint", "bad")
sentiment_dfm <- dfm_select(feedback_dfm, pattern = sentiment_terms)
# Calculate sentiment term frequencies
sentiment_freq <- textstat_frequency(sentiment_dfm)
# Visualize sentiment terms
ggplot(sentiment_freq, aes(x = reorder(feature, frequency), y = frequency)) +
geom_col(aes(fill = feature), show.legend = FALSE) +
coord_flip() +
labs(title = "Frequency of Sentiment-Related Terms",
subtitle = "Customer Feedback Analysis",
x = "Term (Stemmed)",
y = "Frequency") +
theme_minimal() +
scale_fill_manual(values = c(
"love" = "darkgreen", "best" = "darkgreen", "great" = "darkgreen",
"excel" = "darkgreen", "amaz" = "darkgreen", "perfect" = "darkgreen",
"terribl" = "darkred", "poor" = "darkred", "worst" = "darkred",
"disappoint" = "darkred", "bad" = "darkred"
))

# Compare high vs low-rated reviews
# Create a rating category variable
rating_category <- ifelse(docvars(feedback_corp, "rating") >= 4, "High Rating", "Low Rating")
feedback_grouped <- feedback_corp %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english")) %>%
dfm() %>%
dfm_group(groups = rating_category)
# Calculate keyness (distinctive terms)
keyness_stats <- textstat_keyness(feedback_grouped, target = "High Rating")
# Visualize keyness
textplot_keyness(keyness_stats, n = 10, color = c("darkgreen", "darkred")) +
labs(title = "Distinctive Terms: High vs Low Ratings") +
theme_minimal()

Document Similarity Analysis
Understanding document similarity is crucial for tasks like duplicate detection, document clustering, and recommendation systems:
# Use the preprocessed DFM for better similarity measurement
feedback_dfm_clean <- feedback_corp %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english")) %>%
tokens_wordstem() %>%
dfm()
# Calculate document similarity using cosine similarity
doc_similarity <- textstat_simil(feedback_dfm_clean,
method = "cosine",
margin = "documents")
# Find most similar documents to first review (a positive review)
similarity_df <- as.data.frame(as.matrix(doc_similarity))
similarity_to_doc1 <- sort(as.numeric(similarity_df[1, ]), decreasing = TRUE)[2:6] # Skip first (itself)
# First review document
as.character(feedback_corp)[1]
text1 "Absolutely love this product! Best purchase I've made all year. Quality is outstanding."
# Top 3 similar documents
for(i in 2:4) {
doc_idx <- order(as.numeric(similarity_df[1, ]), decreasing = TRUE)[i]
cat("\nDocument", doc_idx, "(Similarity:", round(similarity_df[1, doc_idx], 3), "):\n")
cat(as.character(feedback_corp)[doc_idx], "\n")
}
Document 6 (Similarity: 0.167 ): Product description was misleading. Not what I expected at all. Document 17 (Similarity: 0.167 ): Fantastic! Will definitely purchase from this company again. Document 13 (Similarity: 0.149 ): Love it! Works perfectly and looks great too.
It is worth noting that cosine similarity alone might not be enough since the top most similar document does not match the document’s sentiment.
Sentiment-Aware Similarity
To improve similarity assessment we can incorporate sentiment. Let’s add sentiment score as a feature.
# Define positive and negative sentiment lexicons
positive_words <- c("love", "best", "great", "excellent", "amazing", "perfect",
"fantastic", "outstanding", "happy", "wonderful", "superb")
negative_words <- c("terrible", "poor", "worst", "disappointing", "bad", "awful",
"horrible", "useless", "disappointed", "misleading", "cheap")
# Create tokens
feedback_toks <- feedback_corp %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english"))
# Calculate sentiment scores for each document
sentiment_scores <- sapply(feedback_toks, function(doc_tokens) {
pos_count <- sum(doc_tokens %in% positive_words)
neg_count <- sum(doc_tokens %in% negative_words)
# Net sentiment score
(pos_count - neg_count) / length(doc_tokens)
})
# Create DFM with sentiment features
feedback_dfm_sentiment <- feedback_toks %>%
tokens_wordstem() %>%
dfm()
# Add sentiment score as a weighted feature
# Create a sentiment feature by replicating the sentiment score
sentiment_feature_matrix <- matrix(sentiment_scores * 10, # Scale up for visibility
nrow = ndoc(feedback_dfm_sentiment),
ncol = 1,
dimnames = list(docnames(feedback_dfm_sentiment),
"SENTIMENT_SCORE"))
# Combine with original DFM
feedback_dfm_with_sentiment <- cbind(feedback_dfm_sentiment, sentiment_feature_matrix)
# Calculate similarity with sentiment
doc_similarity_sentiment <- textstat_simil(feedback_dfm_with_sentiment,
method = "cosine",
margin = "documents")
# Compare results
similarity_df_sentiment <- as.data.frame(as.matrix(doc_similarity_sentiment))
#Standard vs Sentiment-Aware Similarity
# First document
cat(as.character(feedback_corp)[1], "\n")
Absolutely love this product! Best purchase I've made all year. Quality is outstanding.
# Sentiment-aware similarity
for(i in 2:4) {
doc_idx <- order(as.numeric(similarity_df_sentiment[1, ]), decreasing = TRUE)[i]
cat("\nDocument", doc_idx, "(Similarity:", round(similarity_df_sentiment[1, doc_idx], 3), "):\n")
cat(as.character(feedback_corp)[doc_idx], "\n")
cat("Rating:", docvars(feedback_corp, "rating")[doc_idx],
"| Sentiment:", round(sentiment_scores[doc_idx], 3), "\n")
}
Document 13 (Similarity: 0.697 ): Love it! Works perfectly and looks great too. Rating: 5 | Sentiment: 0.4 Document 17 (Similarity: 0.65 ): Fantastic! Will definitely purchase from this company again. Rating: 5 | Sentiment: 0.25 Document 5 (Similarity: 0.427 ): Amazing quality and fast delivery. Highly recommend to everyone! Rating: 5 | Sentiment: 0.143
Feature Co-occurrence Analysis
Understanding which words frequently appear together can help reveal semantic relationships.
# Create feature co-occurrence matrix
fcm <- feedback_corp %>%
tokens(remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english")) %>%
fcm()
# View top co-occurrences
feat_cooc <- fcm[1:20, 1:20]
# Visualize semantic network
set.seed(123)
textplot_network(fcm,
min_freq = 2,
edge_alpha = 0.5,
edge_size = 2,
vertex_labelsize = 3) +
labs(title = "Semantic Network of Customer Feedback")

DFM Manipulation and Transformation
quanteda provides powerful functions for manipulating DFMs:
# Trim DFM to remove rare and very common features
feedback_dfm_trimmed <- dfm_trim(feedback_dfm,
min_termfreq = 2, # Remove terms appearing < 2 times
max_docfreq = 0.8, # Remove terms in > 80% of docs
docfreq_type = "prop")
# Original DFM dimensions
print(dim(feedback_dfm))
[1] 20 95
# Trimmed DFM dimensions print(dim(feedback_dfm_trimmed))
[1] 20 22
# Weight DFM using TF-IDF feedback_tfidf <- dfm_tfidf(feedback_dfm_trimmed) # Top features by TF-IDF tfidf_freq <- textstat_frequency(feedback_tfidf, force = T) print(head(tfidf_freq, 15))
feature frequency rank docfreq group 1 product 3.183520 1 8 all 2 qualiti 2.795880 2 4 all 3 expect 2.795880 2 4 all 4 custom 2.471726 4 3 all 5 great 2.471726 4 3 all 6 love 2.000000 6 2 all 7 best 2.000000 6 2 all 8 purchas 2.000000 6 2 all 9 servic 2.000000 6 2 all 10 good 2.000000 6 2 all 11 valu 2.000000 6 2 all 12 work 2.000000 6 2 all 13 took 2.000000 6 2 all 14 wait 2.000000 6 2 all 15 high 2.000000 6 2 all
# Select specific features
service_terms <- c("service", "support", "help", "response", "customer")
service_dfm <- dfm_select(feedback_dfm, pattern = service_terms)
# Frequency of service related terms
print(colSums(service_dfm))
support
1
# Remove specific features
filtered_dfm <- dfm_remove(feedback_dfm, pattern = c("product", "item"))
# Original feature count
nfeat(feedback_dfm)
[1] 95
# Filtered feature count nfeat(filtered_dfm)
[1] 93
Additional Resources
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.